{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "AvxbZv9hQk2S" }, "source": [ "# Dataset curation - Feature scaling for time series data\n", "\n", "[![Open In Colab <](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ShawnHymel/ai-nose/blob/master/ai-nose-dataset-curation.ipynb)\n", "\n", "In the paper \"Efficient BackProp\" [1], LeCun et al. shows that we can achieve a more accurate model (e.g. artificial neural network) in less time by standarizing (i.e. to a mean of 0 and unit variance) and decorrelating our input data.\n", "\n", "However, the process of standarization assumes that the data is normally distributed (i.e. Gaussian). If our data does not follow a Gaussian distribution, we should perform normalization [2], where we divide by the range to produce a set of values between 0 and 1.\n", "\n", "Create a directory */content/dataset* and upload your entire dataset there. Run through the cells in this notebook, following all of the directions to analyze the data and create a curated dataset. If you perform normalization or standarization for any dimension, you will need to copy the mean, standard deviation, minimum, and range arrays for use in your inference code (i.e. preprocessing the data before running inference).\n", "\n", "[1] http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf\n", "\n", "[2] https://becominghuman.ai/what-does-feature-scaling-mean-when-to-normalize-data-and-when-to-standardize-data-c3de654405ed " ] }, { "cell_type": "markdown", "metadata": { "id": "xnALAnX4Ml61" }, "source": [ "## Step 1: Analyze the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ASsQkrDSFmt6" }, "outputs": [], "source": [ "import csv\n", "import os\n", "import shutil\n", "\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "gpwB4SCFIuES" }, "outputs": [], "source": [ "### Settings\n", "HOME_PATH = \"/content\" # Location of the working directory\n", "DATASET_PATH = \"/content/dataset\" # Upload your .csv samples to this directory\n", "OUT_PATH = \"/content/out\" # Where output files go (will be deleted and recreated)\n", "OUT_ZIP = \"/content/out.zip\" # Where to store the zipped output files\n", "\n", "# Do not change these settings!\n", "PREP_DROP = -1 # Drop a column\n", "PREP_NONE = 0 # Perform no preprocessing on column of data\n", "PREP_STD = 1 # Perform standardization on column of data\n", "PREP_NORM = 2 # Perform normalization on column of data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9x1FvOLJI2vX" }, "outputs": [], "source": [ "### Read in .csv files to construct one long multi-axis, time series data\n", "\n", "# Store header, raw data, and number of lines found in each .csv file\n", "header = None\n", "raw_data = []\n", "num_lines = []\n", "filenames = []\n", "\n", "# Read each CSV file\n", "for filename in os.listdir(DATASET_PATH):\n", "\n", " # Check if the path is a file\n", " filepath = os.path.join(DATASET_PATH, filename)\n", " if not os.path.isfile(filepath):\n", " continue\n", "\n", " # Read the .csv file\n", " with open(filepath) as f:\n", " csv_reader = csv.reader(f, delimiter=',')\n", "\n", " # Read each line\n", " valid_line_counter = 0\n", " for line_count, line in enumerate(csv_reader):\n", "\n", " # Check header\n", " if line_count == 0:\n", "\n", " # Record first header as our official header for all the data\n", " if header == None:\n", " header = line\n", "\n", " # Check to make sure subsequent headers match the original header\n", " if header == line:\n", " num_lines.append(0)\n", " filenames.append(filename)\n", " else:\n", " print(\"Error: Headers do not match. Skipping\", filename)\n", " break\n", "\n", " # Construct raw data array, make sure number of elements match number of header labels\n", " else:\n", " if len(line) == len(header):\n", " raw_data.append(line)\n", " num_lines[-1] += 1\n", " else:\n", " print(\"Error: Data length does not match header length. Skipping line.\")\n", " continue\n", "\n", "# Convert our raw data into a numpy array\n", "raw_data = np.array(raw_data).astype(float)\n", "\n", "# Print out our results\n", "print(\"Dataset array shape:\", raw_data.shape)\n", "print(\"Number of elements in num_lines:\", len(num_lines))\n", "print(\"Number of filenames:\", len(filenames))\n", "assert(len(num_lines) == len(filenames))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "EPXyymW9LtPB" }, "outputs": [], "source": [ "### Plot scatter matrix to look for correlation\n", "\n", "# Convert NumPy array to Pandas DataFrame\n", "df = pd.DataFrame(raw_data, columns=header)\n", "\n", "# Create scatter matrix\n", "sm = pd.plotting.scatter_matrix(df, figsize=(15, 15))" ] }, { "cell_type": "markdown", "metadata": { "id": "2GuTOWqkYo9M" }, "source": [ "Notice the wide range of input values! We need to get those to be close to the same range so that the correlation plots will make more sense. Before we do that, we should plot the histograms to see how the data is distributed." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "u1JIjUVT5qwl" }, "outputs": [], "source": [ "### Show correlation matrix as colors\n", "\n", "# Create plot\n", "fig = plt.figure(figsize=(10, 10))\n", "ax = fig.add_subplot(111)\n", "im = ax.matshow(df.corr())\n", "\n", "# Add legend\n", "fig.colorbar(im)\n", "\n", "# Add x and y labels\n", "_ = ax.set_xticks(np.arange(len(header)))\n", "_ = ax.set_xticklabels(header)\n", "_ = ax.set_yticks(np.arange(len(header)))\n", "_ = ax.set_yticklabels(header)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "qt9D2F_PMjdY" }, "outputs": [], "source": [ "### Examine the histograms of all the data\n", "\n", "# Create subplots\n", "num_hists = len(header)\n", "fig, axs = plt.subplots(1, num_hists, figsize=(20,3))\n", "\n", "# Create histogram for each category of data\n", "for i in range(num_hists):\n", " _ = axs[i].hist(raw_data[:, i])\n", " axs[i].title.set_text(header[i])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "x1JHQ3zbZTuP" }, "outputs": [], "source": [ "### Analyze the data\n", "\n", "# Calculate means, standard deviations, and ranges\n", "means = np.mean(raw_data, axis=0)\n", "std_devs = np.std(raw_data, axis=0)\n", "maxes = np.max(raw_data, axis=0)\n", "mins = np.min(raw_data, axis=0)\n", "ranges = np.ptp(raw_data, axis=0)\n", "\n", "# Print results\n", "for i, name in enumerate(header):\n", " print(name)\n", " print(\" mean:\", means[i])\n", " print(\" std dev:\", std_devs[i])\n", " print(\" max:\", maxes[i])\n", " print(\" min:\", mins[i])\n", " print(\" range:\", ranges[i])" ] }, { "cell_type": "markdown", "metadata": { "id": "sWwyVZEv4M6B" }, "source": [ "## Step 2: Choose how to preprocess the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0nb5ZVm_QeqD" }, "outputs": [], "source": [ "### Choose preprocessing method for each column\n", "# PREP_DROP: Drop column\n", "# PREP_NONE: no preprocessing\n", "# PREP_STD: standardization (if data is Gaussian)\n", "# PREP_NORM: normalization (if data is non-Gaussian)\n", "\n", "# Change this to match your picks!\n", "preproc = [PREP_NONE, # Timestamp\n", " PREP_NORM, # Temperature\n", " PREP_NORM, # Humidity\n", " PREP_DROP, # Pressure\n", " PREP_NORM, # CO2\n", " PREP_NORM, # VOC1\n", " PREP_NORM, # VOC2\n", " PREP_NORM, # NO2\n", " PREP_NORM, # Ethanol\n", " PREP_NORM] # CO\n", "\n", "# Check to make sure we have the correct number of preprocessing request elements\n", "assert(len(preproc) == len(header))\n", "assert(len(preproc) == raw_data.shape[1])\n", "\n", "# ### If we do not need the timestamp column, drop it from the data\n", "# if not KEEP_TIMESTAMP:\n", "# header = header[1:]\n", "# raw_data = raw_data[:,1:]\n", "# print(\"Array shape without timestamp:\", data_without_time.shape)" ] }, { "cell_type": "markdown", "metadata": { "id": "KWxMyht0CQUK" }, "source": [ "## Step 3: Perform data preprocessing" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "BqrCbErD7vC6" }, "outputs": [], "source": [ "### Perform preprocessing steps as requested\n", "\n", "# Figure out how many columns we plan to keep\n", "num_cols = sum(1 for x in preproc if x != PREP_DROP)\n", "\n", "# Create empty numpy array and header for preprocessed data\n", "prep_data = np.zeros((raw_data.shape[0], num_cols))\n", "prep_header = []\n", "prep_means = []\n", "prep_std_devs = []\n", "prep_mins = []\n", "prep_ranges = []\n", "\n", "# Go through each column to preprocess the data\n", "prep_c = 0\n", "for raw_c in range(len(header)):\n", "\n", " # Drop column if requested\n", " if preproc[raw_c] == PREP_DROP:\n", " print(\"Dropping\", header[raw_c])\n", " continue\n", "\n", " # Perform data standardization\n", " if preproc[raw_c] == PREP_STD:\n", " prep_data[:, prep_c] = (raw_data[:, raw_c] - means[raw_c]) / std_devs[raw_c]\n", "\n", " # Perform data normalization\n", " elif preproc[raw_c] == PREP_NORM:\n", " prep_data[:, prep_c] = (raw_data[:, raw_c] - mins[raw_c]) / ranges[raw_c]\n", "\n", " # Copy data over if no preprocessing is requested\n", " elif preproc[raw_c] == PREP_NONE:\n", " prep_data[:, raw_c] = raw_data[:, raw_c]\n", "\n", " # Error if code not recognized\n", " else:\n", " raise Exception(\"Preprocessing code not recognized\")\n", "\n", " # Copy header (and preprocessing constants) and increment preprocessing column index\n", " prep_header.append(header[raw_c])\n", " prep_means.append(means[raw_c])\n", " prep_std_devs.append(std_devs[raw_c])\n", " prep_mins.append(mins[raw_c])\n", " prep_ranges.append(ranges[raw_c])\n", " prep_c += 1\n", "\n", "# Show new data header and shape\n", "print(prep_header)\n", "print(\"New data shape:\", prep_data.shape)\n", "print(\"Means:\", [float(\"{:.4f}\".format(x)) for x in prep_means])\n", "print(\"Std devs:\", [float(\"{:.4f}\".format(x)) for x in prep_std_devs])\n", "print(\"Mins:\", [float(\"{:.4f}\".format(x)) for x in prep_mins])\n", "print(\"Ranges:\", [float(\"{:.4f}\".format(x)) for x in prep_ranges])" ] }, { "cell_type": "markdown", "metadata": { "id": "K-E0CfJaCSNc" }, "source": [ "## Step 4: Analyze newly preprocessed data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Nfxoua7-hG1e" }, "outputs": [], "source": [ "### Recreate the scatter matrix to look for correlation\n", "\n", "# Convert NumPy array to Pandas DataFrame\n", "df = pd.DataFrame(prep_data, columns=prep_header)\n", "\n", "# Create scatter matrix\n", "sm = pd.plotting.scatter_matrix(df, figsize=(15, 15))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "mMbAP3tShNuQ" }, "outputs": [], "source": [ "### Show correlation matrix as colors\n", "\n", "# Create plot\n", "fig = plt.figure(figsize=(10, 10))\n", "ax = fig.add_subplot(111)\n", "im = ax.matshow(df.corr())\n", "\n", "# Add legend\n", "fig.colorbar(im)\n", "\n", "# Add x and y labels\n", "_ = ax.set_xticks(np.arange(len(prep_header)))\n", "_ = ax.set_xticklabels(prep_header)\n", "_ = ax.set_yticks(np.arange(len(prep_header)))\n", "_ = ax.set_yticklabels(prep_header)" ] }, { "cell_type": "markdown", "metadata": { "id": "ytiLtULjDPEq" }, "source": [ "## Step 5: Store preprocessed data in CSV files" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "j-zn9JjSl0si" }, "outputs": [], "source": [ "### Delete output directory (if it exists) and recreate it\n", "if os.path.exists(OUT_PATH):\n", " shutil.rmtree(OUT_PATH)\n", "os.makedirs(OUT_PATH)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HbIvoyilljEx" }, "outputs": [], "source": [ "### Write out data to .csv files\n", "\n", "# Go through all the original filenames\n", "row_index = 0\n", "for file_num, filename in enumerate(filenames):\n", "\n", " # Open .csv file\n", " file_path = os.path.join(OUT_PATH, filename)\n", " with open(file_path, 'w') as f:\n", " csv_writer = csv.writer(f, delimiter=',')\n", "\n", " # Write header\n", " csv_writer.writerow(prep_header)\n", "\n", " # Write contents\n", " for _ in range(num_lines[file_num]):\n", " csv_writer.writerow(prep_data[row_index])\n", " row_index += 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "WHwvKD94pOao" }, "outputs": [], "source": [ "### Zip output directory\n", "%cd {OUT_PATH}\n", "!zip -FS -r -q {OUT_ZIP} *\n", "%cd {HOME_PATH}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8FB-mGIGquk0" }, "outputs": [], "source": [] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "gas-sensor-dataset-curation.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }